mcnemar: McNemar's test for classifier comparisons

https://rasbt.github.io/mlxtend/user_guide/evaluate/mcnemar/

McNemar's Test (sometimes also called "within-subjects chi-squared test") is a statistical test for paired nominal data.

「McNemar検定（被験者内χ2乗検定とも呼ばれる）は、対応のある名目(?)データ向けの統計的な検定」

（Wikipediaの記載っぽい。日本語の訳を参考にした）

In context of machine learning (or statistical) models, we can use McNemar's Test to compare the predictive accuracy of two models.

「機械学習（または統計学）モデルの文脈では、McNemar検定を2つのモデルの予測のaccuracyを比較するために使う」

McNemar's test is based on a 2 times 2 contingency table of the two model's predictions.

「McNemar検定は、2つのモデルの予測の2×2の分割表に基づく」

分割表により自由度が(2-1)(2-1)=1となる

マクネマー検定の統計量

帰無仮説

we formulate the null hypothesis that the probabilities p(b) and p(c) are the same

2つの確率 p(model1が正解しmodel2が間違える) と p(model1が間違えmodel2が正解)が同じ

in simplified terms: None of the two models performs better than the other.

代替仮説

the alternative hypothesis is that the performances of the two models are not equal.

「モデルの性能が等しくない」

χ2乗統計量

(b-c)**2/(b+c)

If the sum of cell c and b is sufficiently large, the χ2 value follows a chi-squared distribution with one degree of freedom.

After setting a significance threshold, e.g,. α=0.05 we can compute the p-value -- assuming that the null hypothesis is true, the p-value is the probability of observing this empirical (or a larger) chi-squared value.

「帰無仮説が正しいと仮定すると、p値は実験に基づいたこのχ2乗値（またはより大きなχ2乗値）を観測する確率」

If the p-value is lower than our chosen significance level, we can reject the null hypothesis that the two model's performances are equal.

「p値が選択した有意水準（※ここでは例としてα=0.05）より小さいならば、2つのモデルが等しいという帰無仮説を棄却できる」

（こういう考え方：実験に基づいて観測される確率は0.05（5%）→この起こりやすさを下回る場合、たまたま観測したというよりも前提が誤っている）

連続性の補正

>Edwards proposed a continuity corrected version, which is the more commonly used variant today:

mcnemarメソッドのcorrected引数（True指定＝デフォルト値）

正確なp値

an exact binomial test is recommended for small sample sizes (b+c<25)

the factor 2 is used to compute the two-sided p-value.

mcnemarメソッドのexact引数（True指定）

2×2の分割表を表すarrayを作るmcnemar_tableメソッド

👉mcnemar_table: Contingency table for McNemar's test

出力をmcnemarメソッドに入れる

code:example2.py

>> import numpy as np

>> from mlxtend.evaluate import mcnemar

>> tb_b = np.array(9945, 25], [15, 15)

>> tb_b

array([9945, 25,

15, 15])

>> # 有意水準 alpha = 0.05

>> chi2, p = mcnemar(tb_b, corrected=True) # デフォルトでcorrected=True

>> chi2

2.025

>> p # p > alpha により帰無仮説（2つのモデルの性能は等しい）を採択

0.15472892348537437

code:example3.py

>> import numpy as np

>> from mlxtend.evaluate import mcnemar

>> tb_a = np.array(9959, 11], [1, 29)

>> tb_a

array([9959, 11,

1, 29])

>> # 有意水準 alpha = 0.05

>> chi2, p = mcnemar(tb_a, exact=True) # 11+1 < 25なので exact=True を指定している「we need to compute the exact p-value from the binomial distribution:」

>> chi2 # ドキュメントの「if exact=True (default: False), chi2 is None」と反する

>> p # p < alpha より帰無仮説を棄却。2つのモデルの性能は等しくはない

0.00634765625

>> _, p = mcnemar(tb_a, exact=True)

>> p

0.00634765625

mlxtend 0.21.0ではexace=Trueのときにchi2がNoneにならない

https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/tests/test_mcnemar_test.py#L38-L57 はNoneでassertion

chi2は返り値でない！（テストできていない） https://github.com/rasbt/mlxtend/blob/master/mlxtend/evaluate/tests/test_mcnemar_test.py#L44

tb[1,0]を返す動きがバグっているっぽい

code:bugged_behavior.py

>> tb = np.array(101, 121], [59, 33) # テストの例

>> mcnemar(tb, exact=True)

(59, 4.4344492637551645e-06)

chi2 = min(b, c)と代入していてNoneが返る実装ではない

https://github.com/rasbt/mlxtend/blob/v0.21.0/mlxtend/evaluate/mcnemar.py#L216

Note on the sampling error of the difference between correlated proportions or percentages（脚注1）

mlxtend.evaluate.mcnemarの実装